[python] Add $buckets system table#7989
Conversation
33a99df to
79d7768
Compare
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for adding the Python $buckets system table. The Python-side implementation and tests look reasonable, but this PR currently reverts the recent SubstringTransform fix from #7987.
In paimon-common/src/main/java/org/apache/paimon/predicate/SubstringTransform.java, the diff changes the null check from the referenced source field index back to column 0:
sourceString = row.isNullAt(0) ? null : row.getString(sourceFieldRef.index());and it also removes testSubstringRefInputUsesSourceFieldNullability. This reintroduces the bug where SUBSTRING(FieldRef(index > 0), ...) returns null whenever column 0 is null, even if the actual referenced source column is non-null. That is unrelated to $buckets and would regress existing transform behavior.
Please rebase/merge current master and keep #7987's sourceIndex fix and test, or remove the Java SubstringTransform changes from this PR. I ran the new Python tests locally and they passed:
PYTHONPATH=. python -m pytest -q pypaimon/tests/system/buckets_table_test.py pypaimon/tests/system/system_table_loader_test.py
# 10 passed|
@leaves12138 Thanks for catching this! Rebased onto latest master — the SubstringTransform changes are no longer included in this PR. |
79d7768 to
e18d649
Compare
|
+1 |
Summary
pypaimon currently implements 8 system tables:
$snapshots,$schemas,$options,$manifests,$files,$partitions,$tags,$branches. Compared to the Java side, it still lacks$buckets,$audit_log,$read_optimized,$consumers,$statistics,$aggregation_fields,$file_key_ranges,$table_indexes, etc.This PR adds
$buckets, which is one of the more commonly used system tables for diagnosing data skew. It aggregates manifest entries by (partition, bucket) and exposes per-bucket record_count, file_size, file_count, and last_update_time.Changes
buckets_table.py—BucketsTableimplementationbuckets_table_test.py— 5 end-to-end tests (schema validation, empty snapshot, aggregation correctness, sort order, catalog dispatch)system_table_loader.py— register"buckets"system_table_loader_test.py— update expected table list